Statistical Inference

Brian Gerber and Marc Kéry

Outline

  • Two types of Inference
  • Probability
  • Likelihood
  • Bayesian






Two types of Inference

Two types of Inference

Design-based

  • Strong theoretical basis
  • Population quantities are fixed
  • Sampling frame defines units / population is understood
  • Probabilistic sampling of units
  • Unbiased estimators guaranteed!
  • Randomness induced by sampling process
  • Minimal assumptions!
  • Typically simple parameters

Model-based

  • Strong theoretical basis
  • Population quantities are a random realization of a stochastic generating process; ‘superpopulation’(\(\xi\))
  • Sampling frame and prob. sampling are not necessary
  • “no longer looking for a representation of a real population; looking for a mathematical structure or construction that can explain the distribution of a variable and the relationships between variables” - Raymond and Darsaut, 2025.

Model-based

  • Unbiased estimators are not guaranteed
  • Randomness is implicit in the process and assumed to follow stochastic distribution
  • Some-to-very many assumptions
  • Highly flexible / Basically Dark Magic!

Two types of Inference

  • Know your population by defining a sampling frame
  • Think about non random samples (not ignorable)
  • Use statistical models for flexibility

Statistical Modeling (In a nutshell)

  • Describes stochastic process that could produce the data
  • Observed data is just one possible
  • Model + data allows statistical inferences, i.e., infer features of hypothetical data-generating chance process
  • Statistical modeling: building of model and its analysis using e.g., maximum likelihood or Bayesian posterior inference

What is a statistical model?

  • An abstraction / simplification / explanation
  • Every model has a goal
    • enforce clarity of thought
    • summarize
    • search for patterns
    • understand mechanisms
    • predict

What is a statistical model?


“We are rarely explicit about the goals of our models…though should be!”


Is my model good ?


Inference vs prediction

Parameteric Models

Statistical Modeling Concerns

  • Many hypothetical data-generating mechanisms of stochastic process
    • leads to “messiness” of model selection
  • When sampling a large fraction of a finite population
    • e.g., sample a small finite population

\[ \hat{p} \pm t_{\alpha/2} \sqrt(\frac{\hat{p}(1-\hat{p})}{n}\frac{N-n}{n-1}) \]






Probability

Probability

Probability as the basis for statistical inference

  • The world is uncertain; processes are often not perfectly known (i.e., deterministic)

  • Need to draw conclusions, make decisions, or learn from observations in the face of resulting uncertainty

  • Probability: branch of mathematics dealing with chance processes and their outcomes

    • Extension of logic from certain events to all events in life
    • Basis for statistical modeling and inference

David Spiegelhalter

https://www.nature.com/articles/d41586-024-04096-5

Objectives

  • Connect random variables, probabilities, and parameters

  • define prob. functions

    • discrete and continuous random variables
  • use/plot prob. functions

  • notation!

Probability/Statistics


Probability and statistics are the opposite sides of the same coin.


To understand statistics, we need to understand probability and probability functions.


The two key things to understand this connection is the random variable (RV) and parameters (e.g., \(\theta\), \(\sigma\), \(\epsilon\), \(\mu\)).

Motivation

Why learn about RVs and probability math?

Foundations of:

  • linear regression
  • generalized linear models
  • mixed models

Our Goal:

  • conceptual framework to think about data, probabilities, and parameters
  • mathematical connections and notation

Not Random Variables

\[ \begin{align*} a =& 10 \\ b =& \text{log}(a) \times 12 \\ c =& \frac{a}{b} \\ y =& \beta_0 + \beta_1\times c \end{align*} \]

All variables here are scalars. They are what they are and that is it. scalars.

Scalars are quantities that are fully described by a magnitude (or numerical value) alone.

Random Variables

\[ y \sim f(y) \]

\(y\) is a random variable which may change values each observation; it changes based on a probability function, \(f(y)\).


The tilde (\(\sim\)) denotes “has the probability distribution of”.


Which value (y) is observed is predictable. Need to know parameters (\(\theta\)) of the probability function \(f(y)\).


Random Variables

\[ y \sim f(y|\theta) \]

where ‘|’ is read as ‘given’.

Toss of a coin
Roll of a die
Weight of a captured elk
Count of plants in a sampled plot

Random Variables

\[ y \sim f(y) \]

The values observed can be understand based on the frequency within the population or presumed super-population. These frequencies can be described by probabilities.

Frequency / Probabilitities

Frequency / Probabilitities

We often only get to see ONE sample from this distribution.

Random Variables

We are often interested in the characteristics of the whole population of frequencies,

  • central tendency (mean, mode, median)
  • variability (var, sd)
  • proportion of the population that meets some condition
    P(\(8 \leq y \leq\) 12) =0.68

We infer what these are based on our sample (i.e., statistical inference).

Philosophy

Frequentist Paradigm:

Data (e.g., \(y\)) are random variables that can be described by probability distributions with unknown parameters that (e.g., \(\theta\)) are fixed (scalars).


Bayesian Paradigm:

Data (e.g., \(y\)) are random variables (when observed, then fixed) that can be described by probability functions where the unknown parameters (e.g., \(\theta\)) are also random variables that have probability functions that describe them.

Random Variables

\[ \begin{align*} y =& \text{ event/outcome} \\ f(y|\boldsymbol{\theta}) =& [y|\boldsymbol{\theta}]= \text{ process governing the value of } y \\ \boldsymbol{\theta} =& \text{ parameters} \\ \end{align*} \]

\(f()\) or [ ] is conveying a function (math).

It is called a PDF when \(y\) is continuous and a PMF when \(y\) is discrete.

  • PDF: probability density function
  • PMF: probability mass function

Functions

We commonly use deterministic functions (indicated by non-italic letter); e.g., log(), exp(). Output is always the same with the same input. \[ \hspace{-12pt}\text{g} \\ x \Longrightarrow\fbox{DO STUFF } \Longrightarrow \text{g}(x) \]

\[ \hspace{-14pt}\text{g} \\ x \Longrightarrow\fbox{+7 } \Longrightarrow \text{g}(x) \]

\[ \text{g}(x) = x + 7 \]

Random Variables

Probability: Interested in \(y\), the data, and the probability function that “generates” the data. \[ \begin{align*} y \leftarrow& f(y|\boldsymbol{\theta}) \\ \end{align*} \]

Statistics: Interested in population characteristics of \(y\); i.e., the parameters,

\[ \begin{align*} y \rightarrow& f(y|\boldsymbol{\theta}) \\ \end{align*} \]

Probability Functions

Special functions with rules to guarantee our logic of probabilities are maintained.

Discrete RVs

\(y\) can only be a certain set of values.

  1. \(y \in \{0,1\}\)
    • 0 = dead, 1 = alive
  2. \(y \in \{0, 1, 2, ..., 15\}\)
    • count of pups in a litter; max could by physiological constraint

These sets are called the sample space (\(\Omega\)) or the support of the RV.

PMF

\[ f(y) = P(Y=y) \]

Data has two outcomes (0 = dead, 1 = alive)

\(y \in \{0,1\}\)

There are two probabilities

  • \(f(0) = P(Y=0)\)
  • \(f(1) = P(Y=1)\)

PMF

Axiom 1: The probability of an event is greater than or equal to zero and less than or equal to 1.

\[ 0 \leq f(y) \leq 1 \] Example,

  • \(f(0) = 0.1\)
  • \(f(1) = 0.9\)

PMF

Axiom 2: The sum of the probabilities of all possible values (sample space) is one.

\[ \sum_{\forall i} f(y_i) = f(y_1) + f(y_2) + ... = P(\Omega) =1 \] Example,

  • \(f(0) + f(1) = 0.1 + 0.9 = 1\)

PMF

Still need to define \(f()\), our PMF for \(y \in \{0,1\}\)

The Bernoulli distribution

\[ f(y|\theta) = [y|\theta]= \begin{align} \theta^{y}\times(1-\theta)^{1-y} \end{align} \]

\(\theta\) = P(Y = 1) = 0.2

\[ f(y|\theta) = [y|\theta]= \begin{align} = 0.2^{1}\times(1-0.2)^{0-0} \end{align} \]

\[ f(y|\theta) = [y|\theta]= \begin{align} = 0.2 \times (0.8)^{0} = 0.2 \end{align} \]

Bernoulli PMF

\[ f(y|\theta) = [y|\theta]= \begin{align} \theta^{y}\times(1-\theta)^{1-y} \end{align} \]

The support (\(\Omega\)):

  • \(y \in \{0,1\}\)

Parameter space support (\(\Theta\)):

  • \(\theta \in [0,1]\)
  • General: \(\theta \in \Theta\)

Bernoulli PMF (Code)

What would our data look like for 10 ducks that had a probability of survival (Y=1) of 0.20?

#Define inputs
  theta=0.2;  N=1 

#Random sample - 1 duck
  rbinom(n=1,size=N,theta)
[1] 0
#Random sample - 10 ducks
  rbinom(n=10,size=N,theta)
 [1] 1 0 0 0 1 0 1 0 1 0

Why is this useful to us?

How about to evaluate the sample size of ducks needed to estimate \(\theta\)?

y.mat = replicate(1000,rbinom(n = 10,size=N,theta))
theta.hat = apply(y.mat, 2, mean)

Support

Use a probability function that makes sense for your data/RV; has the correct support.


In Bayesian infernece, we also pick prob. functions that make sense for parameters.


Where do you find information on probability functions?

Normal PDF

The Normal/Gaussian distribution describes the sample space for all values on the real number line.

\[y \sim \text{Normal}(\mu, \sigma) \\ y \in (-\infty, \infty) \\ y \in \mathbb{R}\]

What is the parameter space for \(\mu\) and \(\sigma\)?

Normal Distribution

We collect data on adult alligator lengths (in).

 [1]  90.30  83.02 103.67  85.17  99.20 106.74  90.76 105.28  99.41 101.72


Should we use the Normal Distribution
to estimate the mean?

Does the support of our data match
the support of the PDF?

What PDF does?

Normal Distribution

Are they exactly the same?

Normal Distribution

The issue is when the data are near 0, we might estimate non-sensical values (e.g. negative).

PDF

Continuous RVs

\(y\) are an uncountable set of values.


Provide ecological data examples that match the support?

  1. Gamma: \(y \in (0,\infty)\)
  2. Beta: \(y \in (0,1)\)
  3. Continuous Uniform: \(y \in [a,b]\)

PDF

PDFs of continious RVs follow the same rules as PMFs.

Confusing Differences

Axiom 1:

  • \(f(y) \geq 0\)

PDFs output probability densities, not probabilities.

PDF

Axiom 2:

  • Probabilities are the area b/w a lower and upper value of \(y\); i.e, area under the curve

\[ y \sim \text{Normal}(\mu, \sigma) \\ f(y|\mu,\sigma ) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{1}{2}(\frac{y-\mu}{\sigma})^{2}} \\ \]

PDF

visualize.it(dist = 'norm', stat = c(100),
             list(mu = 100 , sd = 10), section = "upper")

PDF

PDF

The math,

\[ \int_{120}^{\infty} f(y| \mu, \sigma)dy = P(120<Y<\infty) \]

Read this as “the integral of the PDF between 120 and infinity (on the left-hand side) is equal to the probability that the outcome of the random variable is between 120 and infinity (on the right-hand side)”.


The code

pnorm(120,mean=100,sd=10,lower.tail = FALSE)
[1] 0.02275013


Or, we could reverse the question.

qnorm(0.02275,100,10,lower.tail = FALSE)
[1] 120

PDF

Axiom 3:

  • \(\int_{\text{lower support}}^{\text{upper suppport}}f(y)dy = 1\)

The sum of the probability densities of all possible outcomes is equal to 1.

Normal Distribution (PDF Code)

y = rnorm(1000, mean = 20, sd = 3)
hist(y,freq=FALSE,ylim=c(0,0.14))
lines(density(y),lwd=3,col=4)
curve(dnorm(x, mean= 20, sd = 3),xlim=c(0,40),lwd=4,col=3,add=TRUE)

Moments

Properties of all probability functions.

  • 1\({^{st}}\) moment is central tendency
  • 2\({^{nd}}\) moment is the dispersion

Normal Distribution: parameters (\(\mu\) and \(\sigma\)) are 1\({^{st}}\) and 2\({^{nd}}\) moments

Moments

Gamma Distribution: parameters are not moments

Parameters

Shape = \(\alpha\), Rate = \(\beta\)

OR

Shape = \(\kappa\), Scale = \(\theta\), where \(\theta = \frac{1}{\beta}\)

NOTE: probability functions can have Alternative Parameterizations, such they have different parameters.

Moments are functions of these parameters:

  • mean = \(\kappa\theta\) or \(\frac{\alpha}{\beta}\)

  • var = \(\kappa\theta^2\) or \(\frac{\alpha}{\beta^2}\)

Gamma Distribution

Probability:

\[ \begin{align*} y \sim& f(y|\boldsymbol{\theta'}) \\ \end{align*} \]

\[ \begin{align*} \boldsymbol{\theta'} =& \begin{matrix} [\kappa & \theta] \end{matrix} \\ f(y|\boldsymbol{\theta}') &= \text{Gamma(}\kappa, \theta) \\ \end{align*} \]

\[ \begin{align*} f(y|\boldsymbol{\theta}') &= \frac{1}{\Gamma(\kappa)\theta^{\kappa}}y^{\kappa-1} e^{-y/\theta} \\ \end{align*} \]

Sample/parameter Support:

  • \(y \in (0,\infty)\)
  • \(\kappa \in (0,\infty)\)
  • \(\theta \in (0,\infty)\)

Gamma Distribution (PDF Code)

Gamma Distribution

What is the probability we would sample a value >40?
In this population, how common is a value >40?

\[ \begin{align*} p(y>40) = \int_{40}^{\infty} f(y|\boldsymbol{\theta}) \,dy \end{align*} \]

pgamma(q=40, shape=10, scale=2,lower.tail=FALSE)
[1] 0.004995412

Gamma Distribution

What is the probability of observing \(y\) < 20

pgamma(q=20,shape=10, scale=2,lower.tail=TRUE)
[1] 0.5420703

Gamma Distribution

What is the probability of observing 20 < \(y\) < 40

pgamma(q=40,shape=10, scale=2,lower.tail=TRUE)-
pgamma(q=20,shape=10, scale=2,lower.tail=TRUE)
[1] 0.4529343

Gamma Distribution

Reverse the question: What values of \(y\) and lower have a probability of 0.025

qgamma(p=0.025,shape=10, scale=2,lower.tail=TRUE)
[1] 9.590777

Gamma Distribution

What values of \(y\) and higher have a probability of 0.025

qgamma(p=0.025,shape=10, scale=2,lower.tail=FALSE)
[1] 34.16961

Gamma Distribution

curve(dgamma(x,shape=10, scale=2),xlim=c(0,50),lwd=3,
      xlab="y", ylab="dgamma(x,shape=10, scale=2)")
abline(v=c(9.590777,34.16961),lwd=3,col=2)

We can consider samples from this population,

set.seed(154434)
y <- rgamma(100, shape=10, scale=2)

What do we know about this function?

\[ f(y|\lambda) = P(Y=y) = \frac{\lambda^ye^{-\lambda}}{y!} \]

  • What is P(y = 1 | \(\lambda\) = 2) ?
dpois(1,lambda=2)
[1] 0.2706706
(2^1*exp(-2)) / (factorial(1))
[1] 0.2706706

Poisson

The full PMF (for \(\lambda\) = 2):

plot(dpois(0:10, 2), type = 'h', lend = 'butt', lwd = 10)

Poisson

Named Probability Distributions

  • Bernoulli, binomial, Poisson, normal/Gaussian or uniform distributions
    • Many, many, many more…. Rayleigh, Cauchy distributions, Pareto, von Mises)
  • Good to really know at least a few!
    • e.g., build all of linear models, generalized linear models, mixed models

The others side of the coin

Statistics: Interested in estimating population-level characteristics; i.e., the parameters

\[ \begin{align*} y \rightarrow& f(y|\boldsymbol{\theta}) \\ \end{align*} \]

REMEMBER

\(f(y|\boldsymbol{\theta})\) is a probability statement about \(y\), NOT \(\boldsymbol{\theta}\).







Likelihood

Objectives

  • likelihood principle
  • likelihood connection to probability function
  • optimization / parameter estimation

The others side of the coin

Statistics: Interested in estimating population-level characteristics; i.e., the parameters

\[ \begin{align*} y \rightarrow& f(y|\boldsymbol{\theta}) \\ \end{align*} \]

Estimation

  • likelihood
  • Bayesian

Likelihood

Likelihood principle

All the evidence/information in a sample (\(\textbf{y}\), i.e., data) relevant to making inference on model parameters (\(\theta\)) is contained in the likelihood function.

  • “Conceptually simple, but in practice challenging for ecologists”

The pieces

  • The sample data, \(\textbf{y}\)

  • A probability function for \(\textbf{y}\):

    • \(f(\textbf{y};\theta)\) or \([\textbf{y}|\theta]\) or \(P(\textbf{y}|\theta)\)

    • the unknown parameter(s) (\(\theta\)) of the probability function

The Likelihood Function

\[ \begin{align*} \mathcal{L}(\boldsymbol{\theta}|y) = P(y|\boldsymbol{\theta}) = f(y|\boldsymbol{\theta}) \end{align*} \]

The likelihood (\(\mathcal{L}\)) of the unknown parameters, given our data, can be calculated using our probability function.

The Likelihood Function

For example, for \(y_{1} \sim \text{Normal}(\mu,\sigma = 1)\)

CODE:

# A data point
  y = c(10)

#the likelihood the mean is 8, given our data
  dnorm(y, mean = 8)
[1] 0.05399097


If we knew the mean is truly 8, it would also be the probability density of the observation y = 10. But, we don’t know what the mean truly is.

The Likelihood Function

For example, for \(y_{1} \sim \text{Normal}(\mu,\sigma = 1)\)

The key is to understand that the likelihood values are relative, which means we need many guesses.


CODE:

#the likelihood the mean is 9, given our data
  dnorm(y, mean = 9)
[1] 0.2419707

Optimization

# many guesses of the mean
  means = seq(0, 20,by = 0.1) 
# likelihood of each guess of the mean
  likelihood = dnorm(y, mean = means, sd = 1) 

Maximum Likelihood Properties

  • Central Tenet: evidence is relative.

  • Parameters are not RVs. They are not defined by a PDF/PMF.

  • MLEs are consistent. As sample size increases, they will converge to the true parameter value.

  • MLEs are asymptotically unbiased. The \(E[\hat{\theta}]\) converges to \(\theta\) as the sample size gets larger.

  • No guarantee that MLE is unbiased at small sample size.

  • MLEs will have the minimum variance among all estimators, as the sample size gets larger.

MLE with n > 1

What is the mean height of King Penguins?

MLE with n > 1

We go and collect data,

\(\boldsymbol{y} = \begin{matrix} [4.34 & 3.53 & 3.75] \end{matrix}\)


Let’s decide to use the Normal Distribution as our PDF.

\[ \begin{align*} f(y_1 = 4.34|\mu,\sigma) &= \frac{1}{\sigma\sqrt(2\pi)}e^{-\frac{1}{2}(\frac{y_{1}-\mu}{\sigma})^2} \\ \end{align*} \]

AND

\[ \begin{align*} f(y_2 = 3.53|\mu,\sigma) &= \frac{1}{\sigma\sqrt(2\pi)}e^{-\frac{1}{2}(\frac{y_{2}-\mu}{\sigma})^2} \\ \end{align*} \]

AND

\[ \begin{align*} f(y_3 = 3.75|\mu,\sigma) &= \frac{1}{\sigma\sqrt(2\pi)}e^{-\frac{1}{2}(\frac{y_{3}-\mu}{\sigma})^2} \\ \end{align*} \]

Need to connect data together

Or simply,

\[ \textbf{y} \stackrel{iid}{\sim} \text{Normal}(\mu, \sigma) \]

\(iid\) = independent and identically distributed

Need to connect data together

The joint probability of our data with shared parameters \(\mu\) and \(\sigma\),

\[ \begin{align*} & P(Y_{1} = y_1,Y_{2} = y_2, Y_{3} = y_3 | \mu, \sigma) \\ \end{align*} \]

\[ P(Y_{1} = 4.34,Y_{2} = 3.53, Y_{3} = 3.75 | \mu, \sigma) \]

Need to connect data together

IF each \(y_{i}\) is independent, the likelihood of our parameters is simply the multiplication of all three probability densities,

\[ \begin{align*} =& f(4.34|\mu, \sigma)\times f(3.53|\mu, \sigma)\times f(3.75|\mu, \sigma) \end{align*} \] \[ \begin{align*} =& \prod_{i=1}^{3} f(y_{i}|\mu, \sigma) \\ =& \mathcal{L}(\mu, \sigma|y_{1},y_{2},y_{3}) \end{align*} \]

Conditional Independence Assumption

We can do this because we are assuming knowing one observation does not tell us any new information about another observation.


\(P(y_{2}|y_{1}) = P(y_{2})\)

Code

Translate the math to code…

# penguin height data
  y = c(4.34, 3.53, 3.75)

# Joint likelihood of mu=3, sigma =1, given our data
  prod(dnorm(y, mean = 3,sd = 1))
[1] 0.01696987

MLE Code

# maximum likelihood of parameters
  try[which.max(likelihood),]
      mu sigma
925 3.85  0.36


Lets compare to,

sum(y)/length(y)
[1] 3.873333

Likelihood plot (3D)

Sample Size

What happens to the likelihood if we increase the sample size to N=100?

Likelihood

\[ \begin{align*} \mathcal{L}(\boldsymbol{\theta}|\textbf{y}) = \prod_{\forall i} f(y_{i}|\boldsymbol{\theta}) \end{align*} \]

Is the likelihood a probability?

Likelihood

\[ \begin{align*} \mathcal{L}(\boldsymbol{\theta}|\textbf{y}) = \prod_{\forall i} f(y_{i}|\boldsymbol{\theta}) \end{align*} \]

  • Product of small numbers… makes computers sad!
  • Typically work with log-likelihood (sum of log densities)

\[ \begin{align*} \text{log}\mathcal{L}(\boldsymbol{\theta}|\textbf{y}) = \sum_{\forall i} \text{log}(f(y_{i}|\boldsymbol{\theta})) \end{align*} \]

Likelihood

Relativeness

log-likelihood

fun.log = function(a,b){
              sum(dnorm(y,mean = a, sd = b, log=TRUE))
          }

log.likelihood = mapply(a = try$mu, b = try$sigma, FUN=fun.log)
  
# maximum log-likelihood of parameters
  try[which.max(log.likelihood),]
       mu sigma
55683 3.9  0.93

Optimization Code (Numerical)

Let’s let the computer do some smarter guessing, i.e., optimization.

# Note: optim function uses minimization, not maximization. 
# WE want to find the minimum negative log-likelihood
# THUS, need to put negative in our function

neg.log.likelihood=function(par){
  -sum(dnorm(y,mean=par[1],sd=par[2],log=TRUE))
  }

#find the values that minimizes the function
#c(1,1) are the initial values for mu and sigma
fit <- optim(par=c(1,1), fn=neg.log.likelihood,
             method="L-BFGS-B",
             lower=c(0,0),upper=c(10,1)
             )

#Maximum likelihood estimates for mu and sigma
fit$par
[1] 3.901219 0.927352

Linear Regression

King Penguin Height Data (N=100)

out = lm(y~1)
summary(out)

Call:
lm(formula = y ~ 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.00136 -0.61581  0.01208  0.67407  2.58369 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.9012     0.0932   41.86   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.932 on 99 degrees of freedom






Bayesian Inference

All things Bayesian

  • Bayesian Inference

  • Bayes Thereom

  • Bayesian Components

  • Conjugacy

  • Hippo Case Study

  • Bayesian Computation

Andrew Gelman’s Blog

  • Decision analysis
  • Propagation of uncertainty
  • Prior information
  • Regularization
  • Combining multiple sources of information
  • Latent data and parameters.
    • When a model is full of parameters–perhaps even more parameters than data–you can’t estimate them all.
  • Enabling you to go further

Probability, Data, and Parameters

What do we want our model to tell us?

Do we want to make probability statements about our data?


Likelihood = P(data|parameters)


90% CI: the long-run proportion of corresponding CIs that will contain the true value 90% of the time.

Probability, Data, and Parameters

What do we want our model to tell us?

Do we want to make probability statements about our parameters?


Posterior = P(parameters|data)


Alternative Interval: 90% probability that the true value lies within the interval, given the evidence from the observed data.

Likelihood Inference

Estimate of the population size of hedgehogs at two sites.

Bayesian Inference

Posterior Samples

 [1] 102.67671  81.11546  81.75260  87.77246  73.99043  80.70631  76.26219
 [8]  83.99927  64.74208  26.93133

Bayesian Inference

Posterior Samples

 [1] 102.67671  81.11546  81.75260  87.77246  73.99043  80.70631  76.26219
 [8]  83.99927  64.74208  26.93133

Bayesian Inference

diff=post2-post1
length(which(diff>0))/length(diff)
[1] 0.8362

Likelihood Inference (coeficient)

\(y\) is Body size of a beetle species

\(x\) is elevation

summary(glm(y~x))

Call:
glm(formula = y ~ x)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.9862     0.2435   4.049 0.000684 ***
x             0.5089     0.4022   1.265 0.221093    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 1.245548)

    Null deviance: 25.659  on 20  degrees of freedom
Residual deviance: 23.665  on 19  degrees of freedom
AIC: 68.105

Number of Fisher Scoring iterations: 2

Bayesian Inference (coeficient)

Bayesian Inference (coeficient)

#Posterior Mean
  mean(post)
[1] 0.5185186


#Credible/Probability Intervals 
  quantile(post,prob=c(0.025,0.975))
     2.5%     97.5% 
-1.461887  2.403232 


# #Probabilty of a postive effect
 length(which(post>0))/length(post)
[1] 0.709


# #Probabilty of a negative effect
 length(which(post<0))/length(post)
[1] 0.291






Bayesian Theorem

Bayes Theorem

Link

  • Marginal Probability

    • \(P(A)\)
    • \(P(B)\)
Click for Answer
#P(A)
3/10 = 0.3

#P(B)
5/10 = 0.5
Sampled N = 10 locations

Bayes Theorem

  • Joint Probability

    • \(P(A \cap B)\)
Click for Answer
#P(A and B)
2/10 = 0.2
Sampled N = 10 locations

Bayes Theorem

  • Conditional Probability

    • \(P(A|B)\)
    • \(P(B|A)\)
Click for Answer
#P(A|B)
2/5 = 0.4

#P(B|A)
2/3 = 0.6666
Sampled N = 10 locations

Bayes Theorem

  • OR Probability

    • \(P(A\cup B)\)
Click for Answer
# P(A or B)
# P(A) + P(B) - P(A and B)

0.3 + 0.5 - 0.2 = 0.6
Sampled N = 10 locations

Notice that…

\[ \begin{equation} P(A \cap B) = 0.2 \\ P(A|B)P(B) = 0.4 \times 0.5 = 0.2 \\ P(B|A)P(A) = 0.6666 \times 0.3 = 0.2 \\ \end{equation} \]

\[ \begin{equation} P(A|B)P(B) = P(A \cap B) \\ P(B|A)P(A) = P(A \cap B) \\ \end{equation} \]

\[ \begin{equation} P(B|A)P(A) = P(A|B)P(B) \end{equation} \]

Bayes Theoreom

\[ \begin{equation} P(B|A) = \frac{P(A|B)P(B)}{P(A)} \\ \end{equation} \]

\[ \begin{equation} P(B|A) = \frac{P(A \cap B)}{P(A)} \\ \end{equation} \]






Bayesian Components

Bayes Components

param = parameters \[ \begin{equation} P(\text{param}|\text{data}) = \frac{P(\text{data}|\text{param})P(\text{param})}{P(\text{data})} \\ \end{equation} \]

Posterior Probability/Belief

Likelihood

Prior Probability

Evidence or Marginal Likelihood

Bayes Components

param = parameters

\[ \begin{equation} P(\text{param}|\text{data}) = \frac{P(\text{data}|\text{param})P(\text{param})}{\int_{\forall \text{ Param}} P(\text{data}|\text{param})P(\text{param})} \end{equation} \]

Posterior Probability/Belief

Likelihood

Prior Probability

Evidence or Marginal Likelihood

Bayes Components

\[ \begin{equation} \text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}} \\ \end{equation} \]

\[ \begin{equation} \text{Posterior} \propto \text{Likelihood} \times \text{Prior} \end{equation} \]

\[ \begin{equation} \text{Posterior} \propto \text{Likelihood} \end{equation} \]

Bayesian Steps

The Prior

All parameters in a Bayesian model require a prior specified; parameters are random variables.

\(y_{i} \sim \text{Binom}(N, \theta)\)

\(\theta \sim \text{Beta}(\alpha = 4, \beta=2)\)

curve(dbeta(x, 4,2),xlim=c(0,1),lwd=5)

The Prior

\(\theta \sim \text{Beta}(\alpha = 1, \beta=1)\)

curve(dbeta(x, 1,1),xlim=c(0,1),lwd=5)

The Prior

  • The prior describes what we know about the parameter before we collect any data
  • Priors can contain a lot of information (informative priors ) or very little (diffuse priors ); no such thing as a ‘non-informative’ prior

  • “Well-constructed” priors can also improve the behavior of our models (computational advantage)

The Prior

Use diffuse priors as a starting point


It’s fine to use diffuse priors as you develop your model but you should always prefer to use “appropriate, well-contructed informative priors” (Hobbs & Hooten, 2015)

The Prior

Use your “domain knowledge”

We can often come up with weakly informative priors just by knowing something about the range of plausible values of our parameters.

The Prior

Dive into the literature

Find published estimates and use moment matching and other methods to convert published estimates into prior distributions

The Prior

Gateway to regularization / model selection and optimal predictions

The Prior

Visualize your prior distribution

Be sure to look at the prior in terms of the parameters you want to make inferences on

The Prior

Your prior is probabably not invariant to transformations

  • \(\beta \sim \text{Normal}(\mu = 0, \sigma = 100)\)

The Prior

Your prior is probabably not invariant to transformations

The Prior

Do a sensitivity analysis

Does changing the prior change your posterior inference?

The Prior


Are priors bad?


Do they compromise scientific objecivity?

The Prior






Conjugacy

Bayesian Model

Model

\[ \textbf{y} \sim \text{Bernoulli}(p) \]

Prior

\[ p \sim \text{Beta}( \alpha, \beta) \\ \]

These are called Prior hyperparameters

\[ \alpha = 1 \\ \beta = 1 \]

curve(dbeta(x,1,1),xlim=c(0,1),lwd=3,col=2,xlab="p",
      ylab = "Prior Probability Density")

Conjugate Distribution

Likelihood (Joint Probability of y)

\[ \mathscr{L}(p|y) = \prod_{i=1}^{n} P(y_{i}|p) = \prod_{i=1}^{N}(p^{y}(1-p)^{1-y_{i}}) \]

Prior Distribution

\[ P(p) = \frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha,\beta)} \]

Posterior Distribution of p

\[ P(p|y) = \frac{\prod_{i=1}^{N}(p^{y}(1-p)^{1-y_{i}}) \times \frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha,\beta)} }{\int_{p}(\text{numerator})} \]

CONJUGACY!

\[ P(p|y) \sim \text{Beta}(\alpha^*,\beta^*) \]

\(\alpha^*\) and \(\beta^*\) are called Posterior hyperparameters

\[ \alpha^* = \alpha + \sum_{i=1}^{N}y_i \\ \beta^* = \beta + N - \sum_{i=1}^{N}y_i \\ \]

Wikipedia Conjugate Page

Conjugate Derivation






Case Study

Hippos

We do a small study on hippo survival and get these data…

7 Hippos Died
2 Hippos Lived

Hippos: Likelihood Model

\[ \begin{align*} \textbf{y} \sim& \text{Binomial}(N,p)\\ \end{align*} \]

# Survival outcomes of three adult hippos
  y1=c(0,0,0,0,0,0,0,1,1)
  N1=length(y1)
  mle.p=mean(y1)
  mle.p
[1] 0.2222222

Hippos: Bayesian Model (Prior 1)

\[ \begin{align*} \textbf{y} \sim& \text{Binomial}(N,p)\\ p \sim& \text{Beta}(\alpha,\beta) \end{align*} \]

  alpha.prior1=1
  beta.prior1=1

Hippos: Bayesian Model (Prior 2)

  alpha.prior2=10
  beta.prior2=2

Hippos: Bayesian Model (Prior 3)

  alpha.prior3=150
  beta.prior3=15

Hippos: Bayesian Model (Posteriors)

\[ P(p|y) \sim \text{Beta}(\alpha^*,\beta^*)\\ \alpha^* = \alpha + \sum_{i=1}^{N}y_i \\ \beta^* = \beta + N - \sum_{i=1}^{N}y_i \\ \]

# Note- the data are the same, but the prior is changing.
# Gibbs sampler
  post.1=rbeta(10000,alpha.prior1+sum(y1),beta.prior1+N1-sum(y1))
  post.2=rbeta(10000,alpha.prior2+sum(y1),beta.prior2+N1-sum(y1))
  post.3=rbeta(10000,alpha.prior3+sum(y1),beta.prior3+N1-sum(y1))

Hippos: Bayesian Model (Posteriors)

Hippos: Bayesian Model (Posteriors)

Hippos: Bayesian Model (Posteriors)

Hippos: More data! (Prior 1)

y2=c(0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,
     1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
length(y2)
[1] 40

Hippos: More data! (Prior 2)

Hippos: More data! (Prior 3)

Hippos: Data/prior Comparison






Bayesian Computation

Markov Chain Monte Carlo

Often don’t have conjugate likelihood and priors, so we use MCMC algorithims to sample posteriors.

Class of algorithim

  • Metropolis-Hastings
  • Gibbs
  • Reversible-Jump
  • No U-turn Sampling

MCMC Sampling

  • iterations: number of samples
  • thinning: (which iterations to keep), every one (1), ever other one (2), every third one (3)
  • burn-in: (how many of the first samples to remove)
  • chains: (unique sets of samples; needed for convergence tests; requires different initial values)

Why did Bayesian statistics take off so late?

Many options now

Several engines that let you fit models using Bayesian MCMC techniques:

  • BUGS, WinBUGS, OpenBUGS, multiBUGS
  • JAGS
  • Stan
  • Nimble
  • also many others, e.g. greta…

Custom algrithims vs Software Engines

Should you learn how to write your own MCMC algorithms?

  • Absolutely!
  • Hell no!

What are the advantages/disadvantages?





Should you be a frequentist or a Bayesian?





Why we have become Bayesians…

Why we are not real Bayesians…

  • Seldom use informative priors
  • Plus, some inconveniences of Bayesian analysis with MCMC:
    • Take long time to run
    • Sensitivity of results to prior choice (not with ML)
    • BUGS so flexible that may fit nonsensical models

Very happy to use maximum likelihood as well

Conclusion on the Bayesian/frequentist choice

  • Be eclectic!
  • Usually will not use BUGS for trivial problems
  • BUGS is fantastic for more complex models (except for large data sets !)
  • BUGS language is great to actually understand a model

Algorithm of Metropolis et al. (1953)

Algorithm of Metropolis et al. (1953)





THE END